Sample covariance between variables \(x=x_t\) and \(y=y_t\):
\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]
Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):
\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]
Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):
\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]
The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]
\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.
\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.
“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012
Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:
All have similar statistical properties.
| set | mean of x | mean of y | variance of x | variance of y | correlation | intercept | slope |
|---|---|---|---|---|---|---|---|
| 1 | 9 | 7.5 | 11 | 4.127 | 0.82 | 3 | 0.5 |
| 2 | 9 | 7.5 | 11 | 4.128 | 0.82 | 3 | 0.5 |
| 3 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
| 4 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
Illustration of lag for 20.07.2013 in Lausanne:
Could a similar relationship be confirmed by examining correlations between the daily maximum values of radiation and ozone?
A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:
|
\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \] |
\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \] |
“Correlation does not imply causation”:
library(dplyr)
library(reshape2)
library(chron)
library(ggplot2)
source("GRB001.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
Let us load data saved from Lesson 4.
df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")
Let us calculate the daily maximum:
lf <- melt(df, measure.vars=variables)
dm <- lf %>% group_by(site, year, month, day, season, variable) %>%
summarize(value=max(value, na.rm=TRUE))
daily.max <- dcast(dm, site + year + month + day + season ~ variable)
rm(dm) # no longer needed
Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.
ggp <- ggplot(daily.max)+
facet_grid(site~season)+
geom_point(aes(TEMP,O3))
print(ggp)
We can examine the correlation of other pollutants or meterological variables with O3.
lf <- melt(daily.max, measure.vars=setdiff(variables, "O3")) # everything but ozone
ggp <- ggplot(lf)+
facet_grid(site~variable, scale="free_x")+
geom_point(aes(value, O3, group=season, color=season), shape=4)
print(ggp)
ggp <- ggplot(df)+
facet_grid(site~season)+
geom_point(aes(CO, NO2))
print(ggp)
For the following scatterplot matrix, we will use a package called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (confusing, I know). We additionally define a function to include correlation coefficients in our panels.
library(lattice)
Correlation.value <- function(x, y, ...) {
correlation <- cor(x, y,use="pairwise")
if(!is.na(correlation)) {
cpl <- current.panel.limits()
panel.text(mean(cpl$xlim),mean(cpl$ylim),
bquote(italic(r)==.(sprintf("%.2f",correlation))),
adj=c(0.5,0.5),col="blue")
}
}
Plot daily maximum values as pairwise points (only Lausanne).
ix <- grepl("LAU", daily.max[["site"]], fixed=TRUE)
spp <- splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
upper.panel = Correlation.value,
pch=4)
print(spp)
Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.
ix <- grepl("LAU", df[["site"]], fixed=TRUE)
spp <- splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
upper.panel = Correlation.value,
panel = panel.smoothScatter,
pch=4)
print(spp)
Notice some values of correlation went up.
Lag <- function(pair, k) {
out <- data.frame(k, head(pair[,1],-k), tail(pair[,2],-k))
names(out) <- c("lag", colnames(pair))
out
}
lagged <- df %>%
group_by(site, season) %>%
do(rbind(Lag(.[,c("RAD","O3")], 1),
Lag(.[,c("RAD","O3")], 2),
Lag(.[,c("RAD","O3")], 3),
Lag(.[,c("RAD","O3")], 4),
Lag(.[,c("RAD","O3")], 5),
Lag(.[,c("RAD","O3")], 6)))
ggp <- ggplot(lagged) +
geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
facet_grid(lag~season)
print(ggp)